Discovery of Paradigm Dependencies

نویسندگان

  • Jizhou Sun
  • Jianzhong Li
  • Hong Gao
چکیده

Missing and incorrect values often cause serious consequences. To deal with these data quality problems, a class of common employed tools are dependency rules, such as Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Edition Rules (ERs), etc. The stronger expressing ability a dependency has, data with the better quality can be obtained. To the best of our knowledge, all previous dependencies treat each attribute value as a non-splittable whole. Actually however, in many applications, part of a value may contains meaningful information, indicating that more powerful dependency rules to handle data quality problems are possible. In this paper, we consider of discovering such type of dependencies in which the left hand side is part of a regularexpression-like paradigm, named Paradigm Dependencies (PDs). PDs tell that if a string matches the paradigm, element at the specified position can decides a certain other attribute’s value. We propose a framework in which strings with similar coding rules and different lengths are clustered together and aligned vertically, from which PDs can be discovered directly. The aligning problem is the key component of this framework and is proved in NP-Complete. A greedy algorithm is introduced in which the clustering and aligning tasks can be accomplished simultaneously. Because of the greedy algorithm’s high time complexity, several pruning strategies are proposed to reduce the running time. In the experimental study, three real datasets as well as several synthetical datasets are employed to verify our methods’ effectiveness and efficiency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Approach to Discover Dependencies between Service Operations

Service composition is emerging as an important paradigm for constructing distributed applications by combining and reusing independently developed component services. One key issue of service composition is how to identify relevant service operations so as to compose services rapidly and correctly. A promising approach to simplifying the search of relevant service operations in service composi...

متن کامل

Combinational Circuit Design with Estimation of Distribution Algorithms

The authors introduce new approaches for the combinational circuit design based on Estimation of Distribution Algorithms. In this paradigm, the structure and data dependencies embedded in the data (population of candidate circuits) are modeled by a conditional probability distribution function. The new population is simulated from the probability model thus inheriting the dependencies. The auth...

متن کامل

Applications of a Logical Discovery Engine

The clausal discovery engine CLAUDIEN is presented. CLAUDIEN discovers regularities in data and is s representative :of the inductive logic programming paradigm. As such, it represent s data and regu!aritles by means of first order clausal theories. Because the search space of c~ausal theories is larger-than that of attribute value representation, CLAUDIEN alSO accepts as input a declarative sp...

متن کامل

Discover Dependencies from Data - A Review

Functional and inclusion dependency discovery is important to knowledge discovery, database semantics analysis, database design, and data quality assessment. Motivated by the importance of dependency discovery, this paper reviews the methods for functional dependency, conditional functional dependency, approximate functional dependency and inclusion dependency discovery in relational databases ...

متن کامل

Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles

Predictive modelling in drug discovery is challenging to automate as it often contains multiple analysis steps and might involve cross-validation and parameter tuning that create complex dependencies between tasks. With large-scale data or when using computationally demanding modelling methods, e-infrastructures such as high-performance or cloud computing are required, adding to the existing ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1710.02817  شماره 

صفحات  -

تاریخ انتشار 2017